Classification of heterogeneous text data for robust domain-specific language modeling

نویسندگان

Ján Stas

Jozef Juhár

Daniel Hládek

چکیده

The robustness of n-gram language models depends on the quality of text data on which they have been trained. The text corpora collected from various resources such as web pages or electronic documents are characterized by many possible topics. In order to build efficient and robust domain-specific language models, it is necessary to separate domain-oriented segments from the large amount of text data, and the remaining out-of-domain data can be used only for updating of existing in-domain n-gram probability estimates. In this paper, we describe the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain. The proposed algorithm for text classification is based on detection of theme in short text segments based on the most frequent key phrases. In the next step, each text segment is represented in vector space model as a feature vector with term weighting. For classification of these text segments to the in-domain and out-of domain area, document similarity with automatic thresholding are used. The experimental results of modeling the Slovak language and adaptation to the judicial domain show significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Modeling and Classification of Cyberspace Papers Using Text Mining

The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...

متن کامل

Using Domain-Specific Knowledge to Classify E-negotiations

Texts exchanged in business-related Computer-Mediated Communication, or CMC, differ from texts exchanged in other business situations. CMC data have a high concentration of non-standard textual features. The fast-growing amount of business CMC data offers opportunities for the application of statistical Natural Language Processing and Machine Learning methods, especially for text-classification...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Cultural Elements in the Translation of Children's Literature: Persian translation of Roald Dahl’s Matilda in focus

Translation can have long-term effects on all languages and cultures. It is not a mere linguistic act, but mostly a cultural act, since language is by nature one of the major carriers of cultural elements. Thus, the translator’s job is not just transferring the meaning of words and sentences from the source text to the target text. Culture-specific items often cause translation problems. Identi...

متن کامل

Cultural Elements in the Translation of Children's Literature: Persian translation of Roald Dahl’s Matilda in focus

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

EURASIP J. Audio, Speech and Music Processing

دوره 2014 شماره

صفحات -

تاریخ انتشار 2014

Classification of heterogeneous text data for robust domain-specific language modeling

نویسندگان

چکیده

منابع مشابه

Topic Modeling and Classification of Cyberspace Papers Using Text Mining

Using Domain-Specific Knowledge to Classify E-negotiations

A New Document Embedding Method for News Classification

Cultural Elements in the Translation of Children's Literature: Persian translation of Roald Dahl’s Matilda in focus

Cultural Elements in the Translation of Children's Literature: Persian translation of Roald Dahl’s Matilda in focus

عنوان ژورنال:

اشتراک گذاری